Hybrid Selection of Language Model Training Data Using Linguistic Information and Perplexity

نویسنده

  • Antonio Toral
چکیده

We explore the selection of training data for language models using perplexity. We introduce three novel models that make use of linguistic information and evaluate them on three different corpora and two languages. In four out of the six scenarios a linguistically motivated method outperforms the purely statistical state-of-theart approach. Finally, a method which combines surface forms and the linguistically motivated methods outperforms the baseline in all the scenarios, selecting data whose perplexity is between 3.49% and 8.17% (depending on the corpus and language) lower than that of the baseline.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Neural Network Language Models for Candidate Scoring in Hybrid Multi-System Machine Translation

This paper presents the comparison of how using different neural network based language modelling tools for selecting the best candidate fragments affects the final output translation quality in a hybrid multi-system machine translation setup. Experiments were conducted by comparing perplexity and BLEU scores on common test cases using the same training data set. A 12-gram statistical language ...

متن کامل

Adaptive Hybrid POS Cache based Semantic Language Model

This paper presents a language model as an improvement over the stochastic language model for developing a syntactic structure based on word dependencies in local and non local domain. The model copes with the issues of limited amount of training material and the exploitation of the linguistic constraints of the language. The proposed model is a dynamic probabilistic model which uses word depen...

متن کامل

Selection Criteria for Word Trigger Pairsin Language

In this paper, we study selection criteria for the use of word trigger pairs in statistical language modeling. A word trigger pair is de-ned as a long-distance word pair. To select the most signiicant trigger pairs, we need suitable criteria which are the topics of this paper. We extend a baseline language model by a single word trigger pair and use the perplexity of this extended language mode...

متن کامل

Method of Selecting Training Sets to Build Compact and Efficient Statistical Language Model

For statistical language model training, target task matched corpora are required. However, training corpora sometimes include both target task matched and unmatched sentences. In such a case, training set selection is effective for both model size reduction and model performance improvement. In this paper, training set selection method for statistical language model training is described. The ...

متن کامل

Joint and Coupled Bilingual Topic Model Based Sentence Representations for Language Model Adaptation

This paper is concerned with data selection for adapting language model (LM) in statistical machine translation (SMT), and aims to find the LM training sentences that are topic similar to the translation task. Although the traditional approaches have gained significant performance, they ignore the topic information and the distribution information of words when selecting similar training senten...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013